Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Murad Ektilat | muradek@campus.technion.ac.il | |
| Student 2 | Ariel Sernoff | arielsernoff@campus.technion.ac.il |
In this first homework assignment we'll familiarize ourselves with PyTorch as a
general-purpose tensor library with automatic gradient calculation capabilities.
We'll use it to implement some traditional machine-learning algorithms and remind ourselves about
basic machine-learning concepts such as different data sets and their uses, model hyperparameters, cross-validation,
loss functions and gradient derivation. We'll also familiarize ourselves with other highly useful
python machine learning packages such as numpy, sklearn and pandas.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.PyTorch¶In this part, we'll learn about the Dataset and DataLoader classes which are part of PyTorch's torch.util.data package.
These are highly useful abstractions that can greatly reduce the amount of boilerplate code you need to write in order to work with data.
Knowing how to use these classes properly will prove useful in the coming assignments and course project.
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(42)
test = unittest.TestCase()
The Dataset class is an abstraction over a sequence of python objects,
each representing a sample (with or without a label). it's main purpose is
to load a single (possibly labeled) sample from some soure (disk, web, etc) into memory,
and transform it into a usuable representation (e.g. image to tensor).
The Dataset abstracts away exactly when the data is loaded into memory: It can be on
demand when each sample is accessed, all in advance or some combination using e.g. caching.
This is implementation-specific.
As a warm up, lets create a demonstration Dataset that returns noise images. It should:
(C, W, H) containing random contents.0 and num_classes-1.[0, 255].num_samples labeled images.First, let's implement a simple function to generate a labelled random image.
TODO Implement the random_labelled_image function in the hw1/datasets.py module.
Use the code below to test your implementation.
import hw1.datasets as hw1datasets
import cs236781.plot as plot
image_shape = (3, 32, 64)
num_classes = 3
low, high = 0, 10
# Generate some random images and check values
X_ = None
for i in range(100):
X, y = hw1datasets.random_labelled_image(image_shape, num_classes, low, high)
test.assertEqual(X.shape, image_shape)
test.assertIsInstance(y, int)
test.assertTrue(0<= y < num_classes)
test.assertTrue(torch.all((X >= low) & (X < high)))
if X_ is not None:
test.assertFalse(torch.all(X == X_))
X_ = X
plot.tensors_as_images([X, X_]);
In many cases we'll need to consistently get repeatable results even though we're using pseudo-random number generators (PRNGs). The way to do this is to provide a seed to the generator. Given the same seed, a PRNG will always generate the same sequence of numbers.
Here, we need a way to generate the same random image when accessing our dataset at the same index (e.g. to simulate a real set of images).
TODO Implement the torch_temporary_seed function in the hw1/datasets.py module.
Use the code below to test your implementation.
seeds = [42, 24]
torch.random.manual_seed(seeds[0])
# Before the context, the first seed affects the output
data_pre_context = torch.randn(100,)
with hw1datasets.torch_temporary_seed(seeds[1]):
# Within this context, the second seed is in effect
data_in_context = torch.randn(100,)
# After the context, the random state should be restored
data_post_context = torch.randn(100,)
data_around_context = torch.cat([data_pre_context, data_post_context])
# Use first seed, generate data in the same way but without changing context in the middle
torch.random.manual_seed(seeds[0])
data_no_context = torch.cat([torch.randn(100,), torch.randn(100,)])
# Identical results show that the context didn't affect external random state
test.assertTrue(torch.allclose(data_no_context, data_around_context))
# The data generated in the context should match what we would generate with the second seed
torch.random.manual_seed(seeds[1])
test.assertTrue(torch.allclose(data_in_context, torch.randn(100,)))
Now we can implement the dataset as required.
TODO Implement the RandomImageDataset class in the hw1/datasets.py module.
Use the code below to test your implementation.
# Test RandomImageDataset
# Create the dataset
num_samples = 500
num_classes = 10
image_size = (3, 32, 32)
ds = hw1datasets.RandomImageDataset(num_samples, num_classes, *image_size)
# You can load individual items from the dataset by indexing
img0, cls0 = ds[139]
# Plot first N images from the dataset with a helper function
fig, axes = plot.dataset_first_n(ds, 9, show_classes=True, nrows=3)
# The same image should be returned every time the same index is accessed
for i in range(num_samples):
X, y = ds[i]
X_, y_ = ds[i]
test.assertEqual(X.shape, image_size)
test.assertIsInstance(y, int)
test.assertEqual(y, y_)
test.assertTrue(torch.all(X==X_))
# Should raise if out of range
for i in range(num_samples, num_samples+10):
with test.assertRaises(ValueError):
ds[i]
This simple dataset is a useful abstraction when we know in advance the number of samples in our dataset and can access them by indexing. However, in many cases we simply cannot know about all data in advance. For example, perhaps new data is generated in real time.
To deal with these cases, we can use a different type of abstraction: an IterableDataset which provides an interface only to iterate over samples, but not to index them directly.
Let's implement such a dataset which will allow us to iterate over an infinite stream of randomly-generated images.
ds = hw1datasets.ImageStreamDataset(num_classes, *image_size)
# This dataset can't be indexed
with test.assertRaises(NotImplementedError):
ds[0]
# There is no length
with test.assertRaises(TypeError):
len(ds)
# Arbitrarily stop somewhere
stop = torch.randint(2**11, 2**16, (1,)).item()
# We can iterate over it, indefinitely
for i, (X, y) in enumerate(ds):
test.assertEqual(X.shape, image_size)
test.assertIsInstance(y, int)
if i > stop:
break
print(f'Generated {i} images')
test.assertGreater(i, stop)
Generated 22112 images
Now that we've created a simple Dataset to see how they work, we'll load one of pytorch's built-in datasets: CIFAR-10. This is a famous dataset consisting of 60,000 small 32x32 color images classified into 10 classes. You can read more about it here.
The torchvision package has built-in Dataset classes that can download the data to a local folder,
load it, transform it using arbitrary transform functions and iterate over the resulting samples.
Run the following code block to download and create a CIFAR-10 Dataset. It won't be downloaded again if already present.
Run the following block to download CIFAR-10 and plot some random images from it.
import os
import torchvision
import torchvision.transforms as tvtf
cfar10_labels = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
data_root = os.path.expanduser('~/.pytorch-datasets')
cifar10_train_ds = torchvision.datasets.CIFAR10(
root=data_root, download=True, train=True,
transform=tvtf.ToTensor()
)
print('Number of samples:', len(cifar10_train_ds))
# Plot them with a helper function
fig, axes = plot.dataset_first_n(cifar10_train_ds, 64,
show_classes=True, class_labels=cfar10_labels,
nrows=8, hspace=0.5)
Files already downloaded and verified Number of samples: 50000
Now that we've loaded the entire CIFAR-10 dataset, we would like to work with a smaller subset
from it to reduce runtime of the code in this notebook.
A simple way to achieve this with Datasets is to wrap a Dataset in another Dataset that does this for us. This will make it easy to use our subset with DataLoaders as you will see later.
TODO Complete the implementation of SubsetDataset in hw1/datasets.py and use the following code block to test.
subset_len = 5000
subset_offset = 1234
cifar10_train_subset_ds = hw1datasets.SubsetDataset(cifar10_train_ds, subset_len, subset_offset)
dataset_x, dataset_y = cifar10_train_ds[subset_offset + 10]
subset_x, subset_y = cifar10_train_subset_ds[10]
# Tests
test.assertEqual(len(cifar10_train_subset_ds), subset_len)
test.assertTrue(torch.all(dataset_x == subset_x))
test.assertEqual(dataset_y, subset_y)
with test.assertRaises(IndexError, msg="Out of bounds index should raise IndexError"):
tmp = cifar10_train_subset_ds[subset_len]
Notice that when we initialized the Dataset instance for CIFAR-10, we provided a transform parameter.
This is a way to specify an arbitrary transformation that should be run on each sample prior to returning it from the dataset.
In the above, we used the ToTensor() transformation from torchvision.transforms to convert the
images from a PIL (Python Imaging Library) image object which has a shape of 32x32x3 and values in range [0, 255] into a pytorch Tensor of shape 3x32x32 and values in range [0, 1].
To demonstrate the use of transforms, we'll implement two custom transforms which invert the colors and flip the images around the horizontal axis.
TODO Complete the InvertColors and FlipUpDown classes in the hw1/transforms.py module.
import hw1.transforms as hw1transforms
cifar10_inverted_ds = torchvision.datasets.CIFAR10(
root=data_root, download=True, train=True,
transform=tvtf.Compose([ # Compose allows us to chain multiple transforms in a sequence
tvtf.ToTensor(), # Convert PIL image to pytorch Tensor (C,H,W) in range [0,1]
hw1transforms.InvertColors(),
hw1transforms.FlipUpDown(),
])
)
fig, axes = plot.dataset_first_n(cifar10_inverted_ds, 64,
show_classes=True, class_labels=cfar10_labels,
nrows=8, hspace=0.5)
test.assertTrue(torch.allclose(cifar10_train_ds[0][0], torch.flip(1.-cifar10_inverted_ds[0][0], [1])),
"Wrong custom transform")
Files already downloaded and verified
DataLoaders and Samplers¶We have seen that a Dataset is simply an iterable allowing us to iterate over samples and posssible to also access them by index.
Simple to implement, but not very powerful.
The real benefit is when combining them with DataLoader.
A DataLoader samples a batch of samples from the dataset according to logic defined by a Sampler object.
The sampler decides how to partition the dataset into batches of N samples.
The DataLoader additionally handles loading samples in parallel to speed up creation of a batch.
A major motivation here is memory usage. When combining a DataLoader with a Dataset we can easily
control memory constraints by simply setting the batch size.
This is important since large datasets (e.g. ImageNet) do not fit in memory of most machines.
Since a Dataset can lazily load samples from disk on access,
and the DataLoader can sample random samples from it in parallel, we are provided with a simple
yet high-performance mechanism to iterate over random batches from our dataset without needing to
hold all of it in memory.
Let's create a basic DataLoader for our CIFAR-10 dataset.
Run the follwing code block multiple times and observe that different samples are shown each time in the first few batches.
# Create a simple DataLoader that partitions the data into batches
# of size N=8 in random order, using two background proceses
cifar10_train_dl = torch.utils.data.DataLoader(
cifar10_train_ds, batch_size=8, shuffle=True, num_workers=2
)
# Iterate over batches sampled with our DataLoader
num_batches_to_show = 5
for idx, (images, classes) in enumerate(cifar10_train_dl):
# The DataLoader returns a tuple of:
# images: Tensor of size NxCxWxH
# classes: Tensor of size N
fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
if idx >= num_batches_to_show - 1:
break
Here, we specified shuffle=True to the DataLoader. This automatically created a Sampler which just returns indices from the DataSet in a random order.
To better control the content of the batches, we can create our own custom sampler.
Imagine we want each batch to contain one sample from the beginning of the dataset and
another from the end. If we have N samples, we would like to get the following sequence of indices: [0, N-1, 1, N-2, 2, N-3, ...] and then use abatch_size of 2.
TODO Implement the FirstLastSampler class in the hw1/dataloaders.py module.
import hw1.dataloaders as hw1dataloaders
# Test sampler with odd number of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(5)))
test.assertEqual(list(sampler), [0,4, 1,3, 2,])
# Test sampler with evennumber of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(6)))
test.assertEqual(list(sampler), [0,5, 1,4, 2,3])
# Create a DataLoader that partitions the data into batches
# of size N=2 in an order determined by our custom sampler
cifar10_train_dl = torch.utils.data.DataLoader(
cifar10_train_ds, batch_size=2, num_workers=0,
sampler=hw1dataloaders.FirstLastSampler(cifar10_train_ds),
)
# Iterate over batches sampled with our DataLoader
num_batches_to_show = 3
for idx, (images, classes) in enumerate(cifar10_train_dl):
fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
if idx >= num_batches_to_show - 1:
break
Now that we know about DataLoaders we can use them to do something useful: split a training dataset into Training and Validation sets.
A common issue in machine learning models is abundance of hyperparameters that must be selected prior to training the model on data. These hyperparameters may be part of the model itself or part of the training process. We would like to determine which hyperparameter selection can best fit the training data, and, more importantly, can be able to generalize to unseen data.
A prevalent approach is therefore to split the training dataset into two parts: One for actual training, i.e. tuning model parameters e.g. weights in the case of neural nets, and another for validation, i.e. comparing one model or set of hyperparameters to another. After the best model is selected (by seeking the minimal validation error), it can be retrained with the entire training set.

TODO Implement the function create_train_validation_loaders in the hw1/dataloaders.py module.
Use the following code block to check your implementation.
# Testing the train/validation split dataloaders
import hw1.dataloaders as hw1dataloaders
validation_ratio = 0.2
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(cifar10_train_ds, validation_ratio)
train_idx = set(dl_train.sampler.indices)
valid_idx = set(dl_valid.sampler.indices)
train_size = len(train_idx)
valid_size = len(valid_idx)
print('Training set size: ', train_size)
print('Validation set size: ', valid_size)
# Tests
test.assertEqual(train_size+valid_size, len(cifar10_train_ds), "Incorrect total number of samples")
test.assertEqual(valid_size, validation_ratio * (train_size + valid_size), "Incorrect ratio")
test.assertTrue(train_idx.isdisjoint(valid_idx), "Train and validation sets are not disjoint")
Training set size: 40000 Validation set size: 10000
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Determine whether each of the following statements is true or false, and explain why in detail:
display_answer(hw1.answers.part1_q1)
Your answer:
in-sample error is the error rate we get on the same data set we used to build our predictor. the dataset is a subset of the dataset that was not used to build the model, hence it estimates the out of sample error.
for example, given a data with two labels (dogs and cats), we can split the data to two disjoint subsets where each
subset has samples of a specific label (i.e train set = all samples labeled cats, test set = all samples labeled dogs).
As each sample is uniquely labeled, those are disjoint subsets. the model would probably label everything as cats unlike
a model based on subsets that consist of samples with label distribution which is equal to the data set distribution.
This question is ambiguous as the definitions of "test set" and "using" vary in different sources. To answer clearly, we'll first present our terminology: In CV we first allocate a subset of the dataset to be test set. We then use cross validation on the rest of the dataset (excluding the test set). At this part we name the new subsets "training set" and "validation set". So, in the training process of the CV routine, we repeatedly use different training and validation sets, but not the test set. Finally, after the training process we use the test set to estimate our models out os sample error.
Following the logic and terminology presented above, the test set is used to estimate our generalization error.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
Your friend has trained a simple linear regression model, e.g. $\hat{y}=\vec{w}\vec{x}+b$, with some training data. He then evaluated it on a disjoint test-set and concluded that the model has over-fit the training set and therefore decided to add a regularization term $\lambda \norm{\vec{w}}^w$ to the loss, where $\lambda$ is a hyper parameter. In order to select the value of $\lambda$, your friend re-trained the model on his training set with different values of $\lambda$ and then chose the value which produced the best results on the test set.
Is your friend's approach justified? Explain why or why not.
display_answer(hw1.answers.part1_q2)
Your answer: His approach of adding an L2 regularization term to the loss function is justified as it minimizes chances of overfitting model due to overinflated weights. However, as a hyperparameter, the regularization rate \lambda should not be determined by the test set, but optimized by the training process based on the training and validation set.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
In this part, we'll familiarize ourselves with the PyTorch tensor API by implementing a very simple classifier,
kNN, using efficient, vectorized tensor operations alone.
We'll then implement cross-validation, an important ML technique used to find suitable
values for a model's hyperparameters.
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()
Arguably the most basic classification scheme in a supervised learning setting is the
k nearest-neighbor (kNN) classifier.
Given a training data set, kNN's "training" phase consists of simply memorizing it.
When a classification of an unseen sample is required, some distance metric (e.g. euclidean)
is computed from all training samples.
The unseen sample is then classified according to the majority label of it's k nearest-neighbors.
Here we'll implement the most basic kNN, working directly on image pixel values and computing L2 distance between a test image and every known training image. We'll use data from the MNIST database of handwritten digits. This database contains single-channel images with a constant black background and the digits are roughly the same size, which makes it feasible to obtain bearable classification accuracy even with such a naïve model.
Note however that real-world KNN model are often implemented with tree-based data structures to find nearest neighbors in logarithmic time, specialized distance functions and using image features instead of raw pixels.
TODO Implement the TensorView transform in the hw1/transforms module, and run the following code to
load the data we'll work with.
# Prepare data for kNN Classifier
import torchvision.transforms as tvtf
import cs236781.dataloader_utils as dataloader_utils
import hw1.datasets as hw1datasets
import hw1.transforms as hw1tf
# Define the transforms that should be applied to each CIFAR-10 image before returning it
tf_ds = tvtf.Compose([
tvtf.ToTensor(), # Convert PIL image to pytorch Tensor
hw1tf.TensorView(-1), # Reshape to 1D Tensor
])
# Define how much data to load (only use a subset for speed)
num_train = 10000
num_test = 1000
batch_size = 1024
# Training dataset & loader
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds), num_train)
dl_train = torch.utils.data.DataLoader(ds_train, batch_size)
# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds), num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)
# Get all test data
x_test, y_test = dataloader_utils.flatten(dl_test)
TODO Implement the l2_dist function in the hw1/knn_classifier.py module. This is the core of the kNN algorithm. You'll need to use broadcasting to implement it in an efficient, vectorized way (without loops).
import itertools as it
import hw1.knn_classifier as hw1knn
def l2_dist_naive(x1, x2):
"""
Naive distance calculation, just for testing.
Super slow, don't use!
"""
dists = torch.empty(x1.shape[0], x2.shape[0], dtype=torch.float)
for i, j in it.product(range(x1.shape[0]), range(x2.shape[0])):
dists[i,j] = torch.sum((x1[i] - x2[j])**2).item()
return torch.sqrt(dists)
# Test distance calculation
x1 = torch.randn(12, 34)
x2 = torch.randn(45, 34)
dists = hw1knn.l2_dist(x1, x2)
dists_naive = l2_dist_naive(x1, x2)
test.assertTrue(torch.allclose(dists, dists_naive), msg="Wrong distances")
TODO Implement the accuracy function in the hw1/knn_classifier.py module.
This will be our score. It will simply return the fraction of predictions that are correct.
y1 = torch.tensor([0, 1, 2, 3])
y2 = torch.tensor([2, 2, 2, 2])
test.assertEqual(hw1knn.accuracy(y1, y2), 0.25)
TODO Complete the implementation of the KNNClassifier class in the module hw1/knn_classifier.py:
train() method.predict() method.Use the following code to test your implementations.
# Test kNN Classifier
knn_classifier = hw1knn.KNNClassifier(k=10)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)
# Calculate accuracy
accuracy = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy*100:.2f}%')
# Sanity check: at least 80% accuracy
test.assertGreater(accuracy, 0.8)
Accuracy: 91.30%
A common way to choose hyperparameters for a model or even the model itself is by applying
K-fold cross-validation (CV).
For each candidate set of hyperparameters, the model is trained K times, each time with a different split of the training data to train and validation sets (called a fold). The set of hyperparameters which resulted in the the lowest average validation error rate is selected.
More specifically, K-fold CV is usually performed as follows:
K non-overlapping parts.k=0,...,K-1:k-th part as the validation set and the remaining k-1 parts as the training set.Now we would like to find the best value of K for applying our kNN model to CIFAR-10.
In this case we already fixed the model and there is only one hyperparameter, the value of k
(not to be confused with K, the number of folds for the cross validation).
TODO Complete the implementation of the find_best_k function in the knn_classifier.py module.
num_folds = 4
k_choices = [1, 3, 5, 8, 12, 20, 50]
# Run cross-validation
best_k, accuracies = hw1knn.find_best_k(ds_train, k_choices, num_folds)
# Plot accuracies per k
_, ax = plt.subplots(figsize=(12,6), subplot_kw=dict(xticks=k_choices))
for i, k in enumerate(k_choices):
curr_accuracies = accuracies[i]
ax.scatter([k] * len(curr_accuracies), curr_accuracies)
accuracies_mean = np.array([np.mean(accs) for accs in accuracies])
accuracies_std = np.array([np.std(accs) for accs in accuracies])
ax.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
ax.set_title(f'{num_folds}-fold Cross-validation on k')
ax.set_xlabel('k')
ax.set_ylabel('Accuracy')
print('best_k =', best_k)
best_k = 1
Now that we found our best_k, we can train the model with that value of k on the full training set and evaluate the accuracy on the test set:
knn_classifier = hw1knn.KNNClassifier(k=1)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)
# Calculate accuracy
accuracy_best_k = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy_best_k*100:.2f}%')
test.assertGreater(accuracy_best_k, accuracy)
Accuracy: 92.00%
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Does increasing k lead to improved generalization for unseen data? Why or why not? Up to what point? Think about the extremal values of k.
display_answer(hw1.answers.part2_q1)
Your answer:
In our model, increasing the values of k does not improve generalization for unseen data.
In our model, k=1 mostly had the highest accuracy, followed by k=3 that somtimes even slightly surpassed k=1.
We assume that theres high variability of the samples representations, such that for a test sample,
it is likely to find a training sample that is very close (low l2 dist), so for k=1 we'll get a more accurate decision.
But, the larger k we use, we get more false labels in our top-k (since the l2 dist between wrong labels with similar
representation is close to that of a right label with "far" representation), which leads to lower accuracy.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
Explain why (i.e. in what sense) using k-fold CV, as detailed above, is better than:
display_answer(hw1.answers.part2_q2)
Your answer:
train-set accuracy in the sense of model's ability to generalize its performance to an unseen data. In a more "extreme" manner, in the CV method we decrease the chances of overfitting compared to the suggested method.
to the test set. In a sense, thie means our model will underfit to the training set. Usually this will lead to a "bad" model as the training set is usually bigger and "better distributed" than the test set.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial {#1}}{\partial {#2}}} $
In this part we'll learn about loss functions and how to optimize them with gradient descent. We'll then use this knowledge to train a very simple model: a linear SVM.
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()
In multi-class linear classification we have $C$ classes which we assume our samples may belong to. We apply a linear function to a sample $\vec{x} \in \set{R}^{D}$ and obtain a score $s_j$ which represents how well $x$ fits the class $1\leq j\leq C$ according to our model: $$ s_j = \vectr{w_j} \vec{x} + b_j. $$
Note that we have a different set of model parameters (weights) $\vec{w_j},~b_j$ for each class, so a total of $C\cdot(D+1)$ parameters.
To classify a sample, we simply calculate the score for each class and choose the class with the highest score as our prediction.
One interpretation of the weights $\vec{w_j},~b_j$ is that they represent the parameters of an $N$-dimensional hyperplane. Under this interpretation the class score $s_j$ of a sample is proportional to the distance of that sample from the hyperplane representing the $j$-th class. Note that this score can be positive or negative (depending on which side of the hyperplane the sample is). Such a classifier therefore splits the sample space into regions where the farther a sample is from the positive side of a hyperplane for class $j$, the higher $s_j$, so the more likely it belongs to class $j$.
In the context of supervised learning of a linear classifier model, we map a dataset (or batch from a dataset) of $N$ samples (for example, images flattened to vectors of length $D$) to a score for one of each of $C$ possible classes using the linear function above.
To make the implementation efficient, we'll represent the mapping with a single matrix multiplication, employing the "Bias trick": Instead of both $\vec{w_j}$ and $b_j$ per class, we'll put the bias term at the beginning of the weight vector and add a term $1$ at the start of each sample.
The class scores for each sample are then given by:
$$ \mat{S} = \mat{X} \mat{W} $$Where here (and in the code examples you'll work with),
Notes:
TODO Implement the BiasTrick transform class in the module hw1/transforms.py.
import hw1.transforms as hw1tf
tf_btrick = hw1tf.BiasTrick()
test_cases = [
torch.randn(64, 512),
torch.randn(2, 3, 4, 5, 6, 7),
torch.randint(low=0, high=10, size=(1, 12)),
torch.tensor([10, 11, 12])
]
for x_test in test_cases:
xb = tf_btrick(x_test)
print('shape =', xb.shape)
test.assertEqual(x_test.dtype, xb.dtype, "Wrong dtype")
test.assertTrue(torch.all(xb[..., 1:] == x_test), "Original features destroyed")
test.assertTrue(torch.all(xb[..., [0]] == torch.ones(*xb.shape[:-1], 1)), "First feature is not equal to 1")
shape = torch.Size([64, 513]) shape = torch.Size([2, 3, 4, 5, 6, 8]) shape = torch.Size([1, 13]) shape = torch.Size([4])
import torchvision.transforms as tvtf
# Define the transforms that should be applied to each image in the dataset before returning it
tf_ds = tvtf.Compose([
# Convert PIL image to pytorch Tensor
tvtf.ToTensor(),
# Normalize each chanel with precomputed mean and std of the train set
tvtf.Normalize(mean=(0.1307,), std=(0.3081,)),
# Reshape to 1D Tensor
hw1tf.TensorView(-1),
# Apply the bias trick (add bias element to features)
hw1tf.BiasTrick(),
])
The following code will use your transform to load a subset of the MNIST dataset for us to work with.
import hw1.datasets as hw1datasets
import hw1.dataloaders as hw1dataloaders
# Define how much data to load
num_train = 10000
num_test = 1000
batch_size = 1000
# Training dataset
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds),
num_train)
# Create training & validation sets
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(
ds_train, validation_ratio=0.2, batch_size=batch_size
)
# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds),
num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)
x0, y0 = ds_train[0]
n_features = torch.numel(x0)
n_classes = 10
# Make sure samples have bias term added
test.assertEqual(n_features, 28*28*1+1, "Incorrect sample dimension")
TODO Complete the implementation of the __init()__, predict() and evaluate_accuracy() functions in the
LinearClassifier class located in the hw1/linear_classifier.py module.
import hw1.linear_classifier as hw1linear
# Create a classifier
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
# Evaluate accuracy on test set
mean_acc = 0
for (x,y) in dl_test:
y_pred, _ = lin_cls.predict(x)
mean_acc += lin_cls.evaluate_accuracy(y, y_pred)
mean_acc /= len(dl_test)
print(f"Accuracy: {mean_acc:.1f}%")
Accuracy: 5.9%
You should get an accuracy of around 10%, corresponding to a random guess of one of ten classes. You can run the above code block multiple times to sample different initial weights and get slightly different results.
We have seen that a linear model computes the class scores for each sample using a linear mapping as a score function. However in order to train the model, we need to define some measure of how well we've classified our samples compared to their ground truth labels. This measure is known as a loss function, and it's selection is crucial in determining the model that will result from training. A loss function produces lower values the better the classification is.
A very common linear model for classification is the Support Vector Machine. An SVM attempts to find separating hyperplanes that have the property of creating a maximal margin to the training samples, i.e. hyperplanes that are as far as possible from the closest training samples. For example, in the following image we see a simple case with two classes of samples that have only two features. The data is linearly separable and it's easy to see there are infinite possible hyperplanes (in this case lines) that separate the data perfectly.
The SVM model finds the optimal hyperplane, which is the one with the maximal margin. The data points closest to the separating hyperplane are called the Support Vectors (it can be shown that only they determine the hyperplane). We can see that the width of the margin is $\frac{2}{\norm{\vec{w}}}$. In this simple case since the data is linearly separable, there exists a solution where no samples fall within the margin. If the data is not linearly separable, we need to allow samples to enter the margin (with a cost). This is known as a soft-margin SVM.
There are many ways to train an SVM model. Classically, the problem is stated as constrained optimization and solved with quadratic optimization techniques. In this exercise, we'll instead work directly with the uncontrained SVM loss function, calculate it's gradient analytically, and then minimize it with gradient descent. As we'll see in the rest of the course, this technique will be a major component when we train deep neural networks.
The in-sample (empirical) loss function for a multiclass soft-margin SVM can be stated as follows:
$$ L(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} L_{i}(\mat{W}) + \frac{\lambda}{2} \norm{\mat{W}}^2 $$Where the first term is the mean pointwise data-dependent loss $L_{i}$, given by the hinge loss formula,
$$ L_{i}(\mat{W}) = \sum_{j \neq y_i} \max\left(0, \Delta+ \vectr{w_j} \vec{x_i} - \vectr{w_{y_i}} \vec{x_i}\right), $$and the second term is a regularization loss which depends only on model parameters. Note that the hinge loss term sums over the wrong class prediction scores for each sample: $j\neq y_i$, and $y_i$ is the ground-truth label for sample $i$. This can be understood as attempting to make sure that the score for the correct class is higher than the other classes by at least some margin $\Delta > 0$, otherwise a loss is incurred. This way, we allow samples to fall within the margin but incur loss, which gives us a soft-margin SVM.
The regularization term penalizes large weight magnitudes to prevent ambiguous solutions since if e.g. $\mat{W^*}$ is a weight matrix that perfectly separates the data, so is $\alpha\mat{W^*}$ for any scalar $\alpha \geq 1$.
Fitting an SVM model then amounts to finding the weight matrix $\mat{W}$ which minimizes $L(\mat{W})$. Note that we're writing the loss as a function of $\mat{W}$ to emphasize that we wish to minimize it's value on the given data by with respect to the weights $\mat{W}$, even though it obviously depends also on our specific dataset, $\left\{ \vec{x_i}, y_i \right\}_{i=1}^{N}$.
TODO Implement the SVM hinge loss function in the module hw1/losses.py, within the SVMHingeLoss class.
Implement just the loss() function. For now you can ignore the part about saving tensors for the gradient calculation. Run the following to test.
import cs236781.dataloader_utils as dl_utils
from hw1.losses import SVMHingeLoss
torch.random.manual_seed(42)
# Classify all samples in the test set
# because it doesn't depend on randomness of train/valid split
x, y = dl_utils.flatten(dl_test)
# Compute predictions
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
y_pred, x_scores = lin_cls.predict(x)
# Calculate loss with our hinge-loss implementation
loss_fn = SVMHingeLoss(delta=1.)
loss = loss_fn(x, y, x_scores, y_pred)
# Compare to pre-computed expected value as a test
expected_loss = 9.0233
print("loss =", loss.item())
print('diff =', abs(loss.item()-expected_loss))
test.assertAlmostEqual(loss.item(), expected_loss, delta=1e-2)
loss = 9.023382186889648 diff = 8.218688964767296e-05
In this section we'll implement a simple gradient descent optimizer for the loss function we've implemented above. As you have seen in the lectures, the basic gradient-based optimization scheme is as follows:
The crucial component here is the gradient calculation. In this exercise we'll analytically derive the gradient of the loss and then implement it in code. In the next parts of the course we'll enjoy the automatic-differentiation features of PyTorch, but for now we'll do it the old-fashioned way.
An important detail to note is that while $L(\mat{W})$ is scalar-valued, it's a function of all the elements of the matrix $\mat{W}$. Therefore it's gradient w.r.t. $\mat{W}$ is also a matrix of the same shape as $\mat{W}$:
$$ \nabla_{\mat{W}} L = \begin{bmatrix} \frac{\partial L}{\partial W_{1,1}} & & \cdots & \frac{\partial L}{\partial W_{1,C}} \\ \frac{\partial L}{\partial W_{2,1}} & \ddots & \\ \vdots & & \ddots & \\ \frac{\partial L}{\partial W_{D+1,1}} & \cdots & & \frac{\partial L}{\partial W_{D+1,C}} \\ \end{bmatrix} = \begin{bmatrix} \vert & & \vert \\ \frac{\partial L}{\partial\vec{w_1}} & \cdots & \frac{\partial L}{\partial\vec{w_C}}\\ \vert & & \vert \\ \end{bmatrix} \in \set{R}^{(D+1)\times C}. $$For our gradient descent update-step we'll need to create such a matrix of derivatives and evaluate it at the current value of the weight matrix.
The first thing we need to do is formulate an expression for the gradient of the loss function defined above. Since the expression for the loss depends on the columns of $\mat{W}$, we'll derive an expression for the gradient of $L(\mat{W})$ w.r.t. each $\vec{w_j}$:
$$ \pderiv{L}{\vec{w_j}}(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} \pderiv{L_{i}}{\vec{w_j}}(\mat{W}) + \lambda \vec{w_j}. $$To compute the gradient of the pointwise loss, let's define the margin-loss of sample $i$ for class $j$ as follows: $m_{i,j} = \Delta + \vectr{w_j}\vec{x_i} - \vectr{w_{y_i}}\vec{x_i}$. We can then write the pointwise loss and it's gradient in terms of $m_{i,j}$. We'll separate the case of $j=y_i$ (i.e. the gradient for the correct class):
$$ \begin{align} \pderiv{L_i}{\vec{w_j}} & = \begin{cases} \vec{x_i}, & m_{i,j}>0 \\ 0, & \mathrm{else} \\ \end{cases} ,~j \neq y_i \\ \\ \pderiv{L_i}{\vec{w_{y_i}}} & = -\vec{x_i} \sum_{j\neq y_i} \mathbb{1}\left( m_{i,j} > 0 \right) \end{align} $$Where $\mathbb{1}(\cdot)$ is an indicator function that takes the value $1$ if it's argument is a true statement, else it takes $0$.
Note: the hinge-loss function is not strictly speaking differentiable due to the $\max$ operator. However, in practice it's not a major concern. Given that we know what argument the $\max$ "chooses", we can differentiate each one of them separately. This is known as a sub-gradient. In the above, when $m_{i,j} \leq 0$ we know the gradient will simply be zero.
TODO Based on the above, implement the gradient of the loss function in the module hw1/losses.py,
within the SVMHingeLoss class.
Implement the grad() function and complete what's missing in the loss() function.
Make sure you understand the above gradient derivation before attempting to implement it.
Note: you'll be implementing only the first term in the above equation for $\pderiv{L}{\vec{w_j}}(\mat{W})$. We'll add the regularization term later.
# Create a hinge-loss function
loss_fn = SVMHingeLoss(delta=1)
# Compute loss and gradient
loss = loss_fn(x, y, x_scores, y_pred)
grad = loss_fn.grad()
# Sanity check only (not correctness): compare the shape of the gradient
test.assertEqual(grad.shape, lin_cls.weights.shape)
But in the above we only checked the shape, how do we know if we've implemented the gradient correctly?
One approach is to recall the formal definition of the derivative, i.e. $$ f'(x)=\lim_{h\to 0} \frac{f(x+h)-f(x)}{h}. $$ Another way to put this is that for a small enough $h$, $$ f(x+h)\approx f(x)+f'(x)\cdot h. $$
We can use this approach to implement a gradient check by applying very small perturbations of each weight (separately) and using the above formula to check the correctness of the gradient (up to some tolerance). This is called a numerical gradient check.
Here we'll use a different approach, just to get a taste of the concept of automatic differentiation, which we'll rely on heavily in the rest of the course.
In the simple linear model we worked with, the gradient was fairly straightforward to derive analytically and implement. However for complex models such as deep neural networks with many layers and non-linear operations between them this is not the case. Additionally, the gradient must be re-derived any time either the model architecture or the loss function changes. These things make it infeasible in practice to perform deep-learning research using this manual method of gradient derivation. Therefore, all deep-learning frameworks provide a mechanism of automatic differentiation, to prevent the user from needing to manually derive the gradients of loss functions.
PyTorch provides this functionality in a package named torch.autograd which we will use further on in the
next exercises.
For now, here's an example showing that autograd can compute the gradient of the loss function you've implemented.
TODO Run the following code block. Try to understand how autograd is used and why. If the test fails, go back and fix your gradient calculation.
from hw1.losses import SVMHingeLoss
# Create a new classifier and loss function
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
loss_fn = SVMHingeLoss(delta=1)
# Specify we want the gradient to be saved for the weights tensor
# (just for our test)
lin_cls.weights.requires_grad = True
# Forward pass using the weights tensor, calculations will be tracked
y_pred, x_scores = lin_cls.predict(x)
# Compute loss of predictions and their analytic gradient
loss = loss_fn(x, y, x_scores, y_pred)
grad = loss_fn.grad()
# Compute gradient with autograd
autograd_grad = torch.autograd.grad(loss, lin_cls.weights)[0]
# Calculate the difference between analytic and autograd
diff = torch.norm(grad - autograd_grad).item()
print('loss =', loss.item())
print('grad =\n', grad)
print('autograd =\n', autograd_grad)
print('diff =', diff)
test.assertLess(diff, 1e-3, "Gradient diff was too large")
loss = 8.961071968078613
grad =
tensor([[ 0.1500, -0.2600, -0.1600, ..., 0.0100, 0.1100, 0.0600],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
...,
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255]])
autograd =
tensor([[ 0.1500, -0.2600, -0.1600, ..., 0.0100, 0.1100, 0.0600],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
...,
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255]])
diff = 2.107230648107361e-05
Generally, solving a machine-learning problem requires defining the following components:
Now that we have implemented our loss function and it's gradient, we can finally train our model.
Implementation notes:
PyTorch provides very effective mechanisms to implement all of
these components in a decoupled manner.weight_decay parameter. The reason is that we prefer that the part of the loss which only depends
on the model parameters be part of the optimizer, not the loss function (though both ways are possible).
You'll see this pattern later on when you use PyTorch's optimizers in the torch.optim package.TODO
LinearClassifier's train() method.
Use mini-batch SGD for the weight update rule.hyperparams function.
You should play with the hyperparameters to get a feel for what they do to the
loss and accuracy graphs.hp = hw1linear.hyperparams()
print('hyperparams =', hp)
lin_cls = hw1linear.LinearClassifier(n_features, n_classes, weight_std=hp['weight_std'])
# Evaluate on the test set
x_test, y_test = dl_utils.flatten(dl_test)
y_test_pred , _= lin_cls.predict(x_test)
test_acc_before = lin_cls.evaluate_accuracy(y_test, y_test_pred)
# Train the model
svm_loss_fn = SVMHingeLoss()
train_res, valid_res = lin_cls.train(dl_train, dl_valid, svm_loss_fn,
learn_rate=hp['learn_rate'], weight_decay=hp['weight_decay'],
max_epochs=30)
# Re-evaluate on the test set
y_test_pred , _= lin_cls.predict(x_test)
test_acc_after = lin_cls.evaluate_accuracy(y_test, y_test_pred)
# Plot loss and accuracy
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
for i, loss_acc in enumerate(('loss', 'accuracy')):
axes[i].plot(getattr(train_res, loss_acc))
axes[i].plot(getattr(valid_res, loss_acc))
axes[i].set_title(loss_acc.capitalize(), fontweight='bold')
axes[i].set_xlabel('Epoch')
axes[i].legend(('train', 'valid'))
axes[i].grid(which='both', axis='y')
# Check test set accuracy
print(f'Test-set accuracy before training: {test_acc_before:.1f}%')
print(f'Test-set accuracy after training: {test_acc_after:.1f}%')
test.assertGreaterEqual(test_acc_after, 85.0)
hyperparams = {'weight_std': 0.001, 'learn_rate': 0.015, 'weight_decay': 0.03}
Training..............................
Test-set accuracy before training: 13.5%
Test-set accuracy after training: 87.8%
Even though this is a very naïve model, you should get at least 85% test set accuracy if you implemented training correctly. You can try to change the hyperparameters and see whether you get better results. Generally this should be done with cross-validation.
One way to understand what models learn is to try to visualize their learned parameters. There can be many ways to do this. Let's try a very simple one, which is to reshape them into images of the input size and see what they look like.
TODO Implement the weights_as_images() function in the LinearClassifier class.
import cs236781.plot as plot
w_images = lin_cls.weights_as_images(img_shape=(1,28,28))
fig, axes = plot.tensors_as_images(list(w_images))
Additionally, we can better understand the model by plotting some samples and looking at wrong predictions. Run the following block to visualize some test-set examples and the model's predictions for them.
# Plot some images from the test set and their predictions
n_plot = 104
x_test, y_test = next(iter(dl_test))
x_test = x_test[0:n_plot]
y_test = y_test[0:n_plot]
y_test_pred, _ = lin_cls.predict(x_test)
x_test_img = torch.reshape(x_test[:, :-1], (n_plot, 1, 28, 28))
fig, axes = plot.tensors_as_images(list(x_test_img), titles=y_test_pred.numpy(),
nrows=8, hspace=0.5, figsize=(10,8), cmap='gray')
# Highlight the wrong predictions
wrong_pred = y_test_pred != y_test
wrong_pred_axes = axes.ravel()[wrong_pred.numpy().astype(bool)]
for ax in wrong_pred_axes:
ax.title.set_color('red')
ax.title.set_fontweight('bold')
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Explain why the selection of $\Delta > 0$ is arbitrary for the SVM loss $L(\mat{W})$ as it is defined above (the full in-sample loss, with the regularization term).
display_answer(hw1.answers.part3_q1)
Your answer:
1: The selection of $\Delta > 0$ is arbitrary for the SVM loss function because delta the hyper-parameter representing the margin for score difference between the true and all other classes can be directly effected by the hyper-paramater $\lambda$. The "weights decay"\regularization $\lambda$ parameter controls the magnitude of the weights so as we limit this magnitude we are also limiting the difference in scores between the classes $\Delta$. Therefore, we can safely set $\Delta$ arbitrarily and have $\lambda$ compensate. The two hyper-parameters control the same tradeoff and controlling the weights magnitude may prove to be more convenient in some cases.
Given the images in the visualization section above,
display_answer(hw1.answers.part3_q2)
Your answer:
2.1: When viewing the images we can roughly see the outlines of the digit characters. We can assume the classifier is learning the pixels and where they are white or black. Ultimately giving the sample a score based of how close it is to each class\digit. The classifier learn the features which are the pixels and gives a specific weight to each one and based off how often this pixel was white in the training set. The classification errors may be explained by the several factors like, how there are different styles to write each digit, how some digits are similar, how the training samples are not all the same size, centered or aligned the same. For example we can see how the digit 4 was mistaken for 9 due to an elongated stroke.
2.2: The similarities are between the interpretation:
The differences are:
predict, while in the Linear Classifier model we build a general representation for each of the C class options and ultimately predict the classification based on which class is most similar and not specific samples.
-In the Linear classifier we have a much smaller memory stamp due to not having to save the whole training set on the memory while predicting future classes.
Explain your answer by describing what the loss graph would look like in the other two cases when training for the same number of epochs.
and why?
display_answer(hw1.answers.part3_q3)
Your answer:
3.1: We think that the learning rate we choose is good. We can see how in the first 5 or so epochs the accuracy function has a general trend of improvement btu with local falls. This is an important element because we want to avoid local minimals. The accuracy graph begins to flattens out after epoch 5 thus showing some sort of convergence to a minimum this is important because if we were to choose a learning rate to large we might always "bounce" in and out of minimums. On the other hand we can notice that our step if large enough to actually reach a minimum, choosing too small of a step would most likely cause the model to not reach adequate results.
3.2: Based off the graph we can concur that the model is slightly overfitted to the training set. By examining the graph we can notice that the accuracy of the training set is better then that of the validation set by a small margin.
$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial {#1}}{\partial {#2}}} $
In this part we'll perform the classic machine learning task of linear regression.
We'll do some simple data exploration and feature engineering,
like in the pre-deep-learning days.
Our solution will be implemented using some very widely used machine-learning python libraries
(numpy,
scikit-learn and
pandas).
We'll then explore the generalization capacity of the model and perform cross validation.
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 14})
np.random.seed(42)
test = unittest.TestCase()
We'll be working with the Boston housing dataset. This is a famous toy dataset for benchmarking regression algorithms.
The dataset contains 506 samples of median house values in Boston, each with 13 associated house and neighborhood attributes (features; see link for their meaning).
The 13 features of each house are our independent variables, and we're trying to predict the value of MEDV, the median house price (in units of $1000).
Run the following block to load the data. Since this dataset is very small, we can load it directly into memory and forgo any lazy-loading mechanisms.
import sklearn.datasets
import warnings
# Load data we'll work with - Boston housing dataset
# We'll use sklearn's built-in data
with warnings.catch_warnings():
warnings.simplefilter("ignore")
ds_boston = sklearn.datasets.load_boston()
feature_names = list(ds_boston.feature_names)
n_features = len(feature_names)
x, y = ds_boston.data, ds_boston.target
n_samples = len(y)
print(f'Loaded {n_samples} samples')
Loaded 506 samples
Let's use pandas to visualize the independent and target variables.
We'll just show the first 10 samples.
# Load into a pandas dataframe and show some samples
df_boston = pd.DataFrame(data=x, columns=ds_boston.feature_names)
df_boston = df_boston.assign(MEDV=y)
df_boston.head(10).style.background_gradient(subset=['MEDV'], high=1.)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.006320 | 18.000000 | 2.310000 | 0.000000 | 0.538000 | 6.575000 | 65.200000 | 4.090000 | 1.000000 | 296.000000 | 15.300000 | 396.900000 | 4.980000 | 24.000000 |
| 1 | 0.027310 | 0.000000 | 7.070000 | 0.000000 | 0.469000 | 6.421000 | 78.900000 | 4.967100 | 2.000000 | 242.000000 | 17.800000 | 396.900000 | 9.140000 | 21.600000 |
| 2 | 0.027290 | 0.000000 | 7.070000 | 0.000000 | 0.469000 | 7.185000 | 61.100000 | 4.967100 | 2.000000 | 242.000000 | 17.800000 | 392.830000 | 4.030000 | 34.700000 |
| 3 | 0.032370 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 6.998000 | 45.800000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 394.630000 | 2.940000 | 33.400000 |
| 4 | 0.069050 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 7.147000 | 54.200000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 396.900000 | 5.330000 | 36.200000 |
| 5 | 0.029850 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 6.430000 | 58.700000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 394.120000 | 5.210000 | 28.700000 |
| 6 | 0.088290 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.012000 | 66.600000 | 5.560500 | 5.000000 | 311.000000 | 15.200000 | 395.600000 | 12.430000 | 22.900000 |
| 7 | 0.144550 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.172000 | 96.100000 | 5.950500 | 5.000000 | 311.000000 | 15.200000 | 396.900000 | 19.150000 | 27.100000 |
| 8 | 0.211240 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 5.631000 | 100.000000 | 6.082100 | 5.000000 | 311.000000 | 15.200000 | 386.630000 | 29.930000 | 16.500000 |
| 9 | 0.170040 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.004000 | 85.900000 | 6.592100 | 5.000000 | 311.000000 | 15.200000 | 386.710000 | 17.100000 | 18.900000 |
Let's explore the data a bit by plotting a scatter matrix of every variable as a function of every other and a histogram for each.
pd.plotting.scatter_matrix(df_boston, figsize=(20,20));
The above chart shows us (among other things) how our target variable MEDV behaves as a function
of the features (bottom row). By looking at it, can you guess which relationships might be good candidates for a linear model?
Let's use a simple method for deciding which features to use for our linear model: the correlation coefficient, defined as
$$ \rho_{\vec{x}\vec{y}} = \frac{\sigma_{\vec{x}\vec{y}}}{\sigma_{\vec{x}} \sigma_{\vec{y}}} = \frac {\sum_{i=1}^{N} (x_i - \mu_\vec{x}) (y_i - \mu_\vec{y}) } {\sqrt{\sum_{i=1}^{N} (x_i - \mu_\vec{x})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \mu_\vec{y})^2}} $$Where $\vec{x}, \vec{y}$ are $N$ samples of two variables and $\mu, \sigma$ refer to sample means and (co-)variances respectively. The value of $\rho$ is $\pm 1$ for perfect positive or negative linear relationships ($y=ax+b$), and somewhere in between when it's not perfect. Note that this coefficient is rather limited: even when $\rho=0$, the variables may be highly dependent, just not in a linear fashion.
Let's implement this method to find out which features we should include in our initial linear model.
TODO Implement the top_correlated_features() function in the hw1/linear_regression.py module.
import hw1.linear_regression as hw1linreg
n_top_features = 5
top_feature_names, top_corr = hw1linreg.top_correlated_features(df_boston, 'MEDV', n_top_features)
print('Top features: ', top_feature_names)
print('Top features correlations: ', top_corr)
# Tests
test.assertEqual(len(top_feature_names), n_top_features)
test.assertEqual(len(top_corr), n_top_features)
test.assertAlmostEqual(np.sum(np.abs(top_corr)), 2.893, delta=1e-3) # compare to precomputed value for n=5
Top features: ['LSTAT', 'RM', 'PTRATIO', 'INDUS', 'TAX'] Top features correlations: [-0.73766273 0.69535995 -0.50778669 -0.48372516 -0.46853593]
Arguably the simplest machine learning model is linear regression. We are given a dataset $\left\{\vec{x}^{(i)}, y^{(i)}\right\}_{i=1}^{N}$ where $\vec{x}^{(i)} \in \set{R}^D$ is a $D$-dimensional feature vector and $y^{(i)}\in\set{R}$ is a continuous quantity assumed to be the output of some unknown function, i.e. $y^{(i)} = f(\vec{x}^{(i)})$.
Our goal will be to fit a linear transformation, parametrized by weights vector and bias term $\vec{w}, b$, such that given a sample $\vec{x}$ our prediction is
$$ \hat{y} = \vectr{w}\vec{x} + b. $$We'll judge the performance of the model using the ordinary least-squares sense, i.e. with a loss function of given by the mean-squared error (MSE) with the addition of an L2-regularization term: $$ L(\vec{w}) = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2 = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \vectr{w}\vec{x}^{(i)} - b \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2. $$
Minimizing the above $L(\vec{w})$ is a simple convex optimization problem with a closed-form solution. Of course, this can also be solved using iterative descent methods which are necessary when the data is too large to fit in memory.
As a warm up with numpy, let's implement the bias trick again (this time using numpy and as a sklearn transformation)
so that our linear regression model will operate on data with an added bias term.
TODO Implement the class BiasTrickTransformer in the hw1/linear_regression.py module.
# Test BiasTrickTransformer
bias_tf = hw1linreg.BiasTrickTransformer()
test_cases = [
np.random.randint(10, 20, size=(5,2)),
np.random.randn(10, 1),
]
for xt in test_cases:
xb = bias_tf.fit_transform(xt)
print(xb.shape)
test.assertEqual(xb.ndim, 2)
test.assertTrue(np.all(xb[:,0] == 1))
test.assertTrue(np.all(xb[:, 1:] == xt))
(5, 3) (10, 2)
Lets now define a function to assess the accuracy of our models prediction (loss and score). We'll use the MSE loss as above and $R^2$ as a score. Note that $R^2$ is a number in the range [0, 1] which represents how much better the regression fits the data in compared to a simple average of the data. It is given by $$ R^2 = 1-\frac{\sum_i (e^{(i)})^2}{\sum_i (y^{(i)} - \bar{y})^2}, $$ where $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ is known as the residual for each sample $i$ and $\bar{y}$ is the data mean.
TODO Implement the mse_score and r2_score function in the hw1/linear_regression.py module.
def evaluate_accuracy(y: np.ndarray, y_pred: np.ndarray):
"""
Calculates mean squared error (MSE) and coefficient of determination (R-squared).
:param y: Target values.
:param y_pred: Predicted values.
:return: A tuple containing the MSE and R-squared values.
"""
mse = hw1linreg.mse_score(y, y_pred)
rsq = hw1linreg.r2_score(y, y_pred)
return mse, rsq
Of course, these measures and many others are built-in to sklearn. We'll use these to test.
from sklearn.metrics import r2_score as r2, mean_squared_error as mse
for i in range(10):
test_y = np.random.randn(20)
test_y_pred = np.random.randn(20)
mse_actual, r2_actual = evaluate_accuracy(test_y, test_y_pred)
mse_expected, r2_expected = mse(test_y, test_y_pred), r2(test_y, test_y_pred)
test.assertAlmostEqual(mse_actual, mse_expected, delta=1e-6)
test.assertAlmostEqual(r2_actual, r2_expected, delta=1e-6)
Now we can implement our model.
TODO Based on the above equations for the model and loss, implement the predict() and fit()
functions in the LinearRegressor class within the module linear_regression.py.
You'll need to first derive the closed-form solution for the optimal $\vec{w}$ based on the loss.
Run the code block below to fit your model to each of the 5 top
features you selected (one at a time).
A very useful feature of sklearn is pipelines: We can create a composite model made of multiple
steps which transform the features (using fit_transform) and a final step which calculates the actual
model predictions (using fit_predict()). Each step in the pipeline should be an sklearn Estimator
instance and implement the appropriate methods.
For example, lets create a pipeline that scales each input feature to zero-mean and unit variance, applies our bias-trick transformation and finally uses our Linear Regression model.
import sklearn.preprocessing
import sklearn.pipeline
# Create our model as a pipline:
# First we scale each feature, then the bias trick is applied, then the regressor
model = sklearn.pipeline.make_pipeline(
sklearn.preprocessing.StandardScaler(),
hw1linreg.BiasTrickTransformer(),
hw1linreg.LinearRegressor(),
)
# Test the model implementation is correct
y_pred = model.fit_predict(x, y)
full_dataset_mse, _ = evaluate_accuracy(y, y_pred)
test.assertEqual(y_pred.shape, y.shape)
test.assertAlmostEqual(full_dataset_mse, 22.660, delta=1e-1)
From here we'll use our pipleline as a model.
We want to now check the predictive power of different features. First, we'll implement a small helper function that will allow us to fit a model on a subset of features from our dataframe.
TODO Implement the fit_predict_dataframe function in the linear_regression.py module.
# Full dataset
y_pred = hw1linreg.fit_predict_dataframe(
model, df_boston, target_name='MEDV'
)
test.assertAlmostEqual(full_dataset_mse, evaluate_accuracy(y,y_pred)[0], delta=1e-1)
# Subset of features
y_pred = hw1linreg.fit_predict_dataframe(
model, df_boston, target_name='MEDV', feature_names=['CHAS', 'B']
)
test.assertAlmostEqual(72.982, evaluate_accuracy(y,y_pred)[0], delta=1e-1)
We'll use each feature separately and fit multiple times to get an idea of the predictive power of each of our top-5 features.
fig, ax = plt.subplots(nrows=1, ncols=n_top_features, sharey=True, figsize=(20,5))
actual_mse = []
# Fit a single feature at a time
for i, feature_name in enumerate(top_feature_names):
y_pred = hw1linreg.fit_predict_dataframe(model, df_boston, 'MEDV', [feature_name])
mse, rsq = evaluate_accuracy(y, y_pred)
# Plot
xf = df_boston[feature_name].values.reshape(-1, 1)
x_line = np.arange(xf.min(), xf.max(), 0.1, dtype=float).reshape(-1, 1)
y_line = model.predict(x_line)
ax[i].scatter(xf, y, marker='o', edgecolor='black')
ax[i].plot(x_line, y_line, color='red', lw=2, label=f'fit, $R^2={rsq:.2f}$')
ax[i].set_ylabel('MEDV')
ax[i].set_xlabel(feature_name)
ax[i].legend(loc='upper right')
actual_mse.append(mse)
# Test regressor implementation
print(actual_mse)
expected_mse = [38.862, 43.937, 62.832, 64.829, 66.040]
for i in range(len(expected_mse)):
test.assertAlmostEqual(expected_mse[i], actual_mse[i], delta=1e-1)
C:\Users\murad\miniconda3\envs\cs236781-hw\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn( C:\Users\murad\miniconda3\envs\cs236781-hw\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn( C:\Users\murad\miniconda3\envs\cs236781-hw\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn( C:\Users\murad\miniconda3\envs\cs236781-hw\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn( C:\Users\murad\miniconda3\envs\cs236781-hw\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
[38.86260846068978, 43.93789891484722, 62.83209551907833, 64.82947233954711, 66.0404346824534]
As you can see, the results are not great. We can't reliably predict the target variable based on just one of these. Now let's fit a model based on the combined top-5 features. Since it's difficult to visualize high-dimensional hyperplanes, instead of plotting the data and fitted hyperplane, we'll create a residuals plot. This is the plot of the error, or residual $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ vs. the predicted value $\hat{y}^{(i)}$.
# Fit top-5 features
y_pred = hw1linreg.fit_predict_dataframe(model, df_boston, 'MEDV', top_feature_names)
mse5, rsq5 = evaluate_accuracy(y, y_pred)
print(f'mse5={mse5:.2f}, rsq5={rsq5:.2f}')
# Residuals plot
def plot_residuals(y, y_pred, ax=None, res_label=None):
if ax is None:
_, ax = plt.subplots()
res = y - y_pred
ax.scatter(y_pred, y_pred-y, marker='s', edgecolor='black', label=res_label)
ax.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3)
ax.hlines(y=[-res.std(), res.std()], xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3, linestyles=':')
ax.set_xlabel(r'$\hat{y}$')
ax.set_ylabel(r'$y - \hat{y}$')
if res_label is not None:
ax.legend()
return ax
plot_residuals(y, y_pred)
# Sanity test
test.assertLess(mse5, 30)
mse5=27.20, rsq5=0.68
That's better, but there's still more to be desired. Let's try to improve our model further.
We can see that from the scatter matrix that some of the relationships between our features and target variable are obviously not linear and cannot be modeled completely by fitting lines (or hyperplanes). Is there a way to fit a non-linear function to the data (such as a polynomial) but still use the simplicity of the Linear Regression model?
Suppose we have 2-dimensional feature vectors, $\vec{x}=(x_1, x_2)$. We can fit a linear regression model with 3 parameters which represents some 2-d plane. However if we transform each such feature vector, for example by $\vec{\tilde{x}} = (x_1, x_2, x_1^2, x_1 x_2, x_2^2)$, then we can now fit a model with 6 parameters to the same data. We can thus increase the capacity of our model (its ability to fit a wide variety of functions) by adding more parameters that correspond to non-linear transformations of the features.
Let's implement some hand-crafted nonlinear features based on all the features in the dataset. This step in the machine learning process is sometimes also referred to as feature engineering. In the rest of the course, you'll see how Deep Learning allows us to learn the features themselves instead of creating them by hand, and thus creating very powerful representations.
TODO Implement the BostonFeaturesTransformer class in the hw1/linear_regression.py module.
You can create any features you want, for example given $\vec{x}=(x_1,x_2)$ you could generate features
such as $x_1^2$, $x_1 \log{x_2}$, $e^{-x_1}$ and so on.
Try to "engineer" features by inferring relationships based on the scatter matrix.
Notes:
PolynomialFeatures from sklearn.preprocessing
to simplify generation of polynomial features.def linreg_boston(model, x, y, fit=True):
if fit:
model.fit(x, y)
y_pred = model.predict(x)
mse, rsq = evaluate_accuracy(y, y_pred)
return y_pred, mse, rsq
# Fit with all features this time
x = df_boston[feature_names].values
# Use model with a custom features transform
model = sklearn.pipeline.make_pipeline(
hw1linreg.BiasTrickTransformer(),
hw1linreg.BostonFeaturesTransformer(),
hw1linreg.LinearRegressor()
)
y_pred, mse, rsq = linreg_boston(model, x, y)
plot_residuals(y, y_pred)
# Test: You should get at least 2x lower loss than previously, easily even lower
print(f'target_mse={mse5/2:.3f}')
print(f'mse={mse:.2f}, rsq={rsq:.2f}')
test.assertLess(mse, mse5 / 2)
target_mse=13.601 mse=7.53, rsq=0.91
By now, your model should produce fairly accurate predictions. Note howerver that we trained it on the entire Boston dataset.
When training models, we don't actually care about their performance on the training data; we're not interested in solving optimization problems. What we want is the ability to generalize: How well will it perform on novel, unseen data? In other words, did the model learn some function similar to the one actually generating the samples?
Let's find out how good our model is for unseen data the usual way: We'll split our dataset into a training and test set.
from sklearn.model_selection import train_test_split
# Data and model
x = df_boston[feature_names].values
y = df_boston['MEDV'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = sklearn.pipeline.make_pipeline(
hw1linreg.BiasTrickTransformer(),
hw1linreg.BostonFeaturesTransformer(),
hw1linreg.LinearRegressor()
)
However, instead of just fitting the model on the training set and evaluating on the test set, we'll use cross-validation to find a set of model hyperparameters that allow the model to generalize well.
We'll again use k-fold CV to split the training set into k-folds where for each set of hyperparameters being tested, each time one of the folds is treated like the test set and the model is fitted to the rest. However, this time we have more hyperparameters to test.
TODO Implement the cv_best_hyperparams() function in the hw1/linear_regression.py module.
# Define search-spaces for hyper parameters
degree_range = np.arange(1, 4)
lambda_range = np.logspace(-3, 2, base=10, num=20)
# Use cross-validation to find best combination of hyperparameters
best_hypers = hw1linreg.cv_best_hyperparams(
model, x_train, y_train, k_folds=3,
degree_range=degree_range, lambda_range=lambda_range
)
print('Best hyperparameters: ', best_hypers)
# Make sure returned params exist in the model
for param in best_hypers.keys():
test.assertIn(param, model.get_params())
Best hyperparameters: {'bostonfeaturestransformer__degree': 2, 'linearregressor__reg_lambda': 0.23357214690901212}
Now lets use the best hyperparameters to train a model on the training set and evaluate it on the test set.
# Use the best hyperparameters
model.set_params(**best_hypers)
# Train best model on full training set
y_pred_train, mse, rsq = linreg_boston(model, x_train, y_train)
print(f'train: mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_train, y_pred_train, res_label='train')
# Evaluate on test set
y_pred_test, mse, rsq = linreg_boston(model, x_test, y_test, fit=False)
print(f'test: mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_test, y_pred_test, ax=ax, res_label='test')
# Make sure test-set accuracy is good
test.assertLess(mse, 20) # You should be able to get way below this
train: mse=7.04, rsq=0.92 test: mse=18.11, rsq=0.74
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Whats the ideal pattern to see in a residual plot? Based on the residual plots you got above, what can you say about the fitness of the trained model? Compare the plot for the top-5 features with the final plot after CV.
display_answer(hw1.answers.part4_q1)
Your answer: There are several factors that make an ideal pattern in a residual plot
(closest to zero). 2. In a residual plot where we split the dataset to train and test set, we would like to see a similarly distributed residual for the two subsets, so we know our model is able to generalize its performance and it is not overfitting to one of the subsets. 3. we would like to see a linear horizontal residual pattern so we know that the models performance is "equally good" for all values. Which means, we want to avoid the case where our model is very good (=small residual) for some values, yet very bad (=large residual) for other values.
comparing the final model (after cv) to the one with the top 5 features, we see that the final model is better in all three metrics described above. Thus, we infer that the final model is a better model for our purpose and under out metrics.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
Explain the effect of adding non-linear features to our data.
display_answer(hw1.answers.part4_q2)
Your answer:
Adding non-linear features to our model can be seen as projecting our samples from one space to another, s.t in the new space we can linearly separate our samples better than previous space enabled us.
hypotheses space)
a. if the question is whether we can apply any non linear function on the features to create new features for our linear regressor, than the answer is yes. any function that will generate a new feature will work (although not necessarily improve the model). b. if the question is whether we can "predict" (/imitate) any non-linear function with our linear regressor (using the ability to create "non linear" features) than the answer is yes although in many cases it would demand a lot of calculations or a preliminary knowledge about our data.
more complicated than just a hyperplane. for example, we could get a sphere, or multiple hyperplanes that separates our original space.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$
Regarding the cross-validation:
np.logspace instead of np.linspace? Explain the advantage for CV.display_answer(hw1.answers.part4_q3)
Your answer:
linspace wouldve provided us. This enables us to find a better value faster.
for the degree, we get that the model was fitted (k_folds)(A)(B) times. In this specific call, we have k_folds=3, A=3, B=20, so we get 180 times.
Write your answer using markdown and $\LaTeX$:
# A code block
a = 2
An equation: $e^{i\pi} -1 = 0$